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Abstract 

o 

J We propose a new method to enforce priors on the solution of the nonneg- 

^ ative matrix factorization (NMF). The proposed algorithm can be used for 

denoising or single-channel source separation (SCSS) applications. The NMF 
^ solution is guided to follow the Minimum Mean Square Error (MMSE) esti- 

^ mates under Gaussian mixture prior models (GMM) for the source signal. In 

SCSS applications, the spectra of the observed mixed signal are decomposed 
O as a weighted linear combination of trained basis vectors for each source using 



NMF. In this work, the NMF decomposition weight matrices arc treated as 
a distorted image by a distortion operator, which is learned directly from the 
observed signals. The MMSE estimate of the weights matrix under GMM 
prior and log-normal distribution for the distortion is then found to improve 
the NMF decomposition results. The MMSE estimate is embedded within 
the optimization objective to form a novel regularized NMF cost function. 
The corresponding update rules for the new objectives are derived in this 
paper. Experimental results show that, the proposed regularized NMF al- 
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gorithm improves the source separation performance compared with using 

NMF without prior or with other prior models. 

Keywords: Single channel source separation, nonnegative matrix 

factorization, minimum mean square error estimates, and Gaussian mixture 

models. 



1. Introduction 



Nonnegative matrix factorization (Lee and Seung, 2001) is an important 



tool for source separation applications, especially when only one observation 
of the mixed signal is available. NMF is used to decompose a nonnegative 
matrix into a multiplication of two nonnegative matrices, a basis matrix and 
a gains matrix. The basis matrix contains a set of basis vectors and the gains 
matrix contains the weights corresponding to the basis vectors in the basis 
matrix. The NMF solutions are found by solving an optimization problem 
based on minimizing a predefined cost function. As most optimization prob- 
lems, the main goal in NMF is to find the solutions that minimize the cost 
function without considering any prior information rather than the nonneg- 
ativity constraint. There have been many works that tried to enforce prior 
information related to the nature of the application on the NMF decompo- 
sition results. For audio source separation applications, the continuity and 



sparsity priors were enforced in the NMF decomposition weights (Virtanen 



2007 


). In 


Bertin et al. 


(2009 


), and 


Bertin et al. 


(2010 



monicity priors were enforced on the NMF solution in Bayesian framework 



and applied to music transcription. In Wilson et al. (2008b), and Wilson et al. 



(2008a) the regularized NMF was used to increase the NMF decomposition 
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weights matrix likelihood under a prior Gaussian distribution. In Fevotte 



et al. (2009), Markov chain prior model for smoothness was used within a 



Bayesian framework in regularized NMF with Itakura-Saito (IS-NMF) di- 



vergence. In Virtanen et al. (2008), the conjugate prior distributions on 
the NMF weights and basis matrices solutions with the Poisson observation 
model within Bayesian framework was introduced. The Gamma distribution 



and the Gamma Markov chain (Cemgil and Dikmen, 2007) were used as pri- 



ors for the basis and weights/gains matrices respectively in Virtanen et al. 



(2008). A mixture of Gamma prior model was used as a prior for the basis 



matrix in 



Virtanen and Cemgil (2009). The regularized NMF with smooth- 



ness and spatial decorrelation constraints was used in Chen et al. (2006) for 



EEG applications. In Cichocki et al. (2006), and Chen et al. (2006), a variety 
of constrained NMF algorithms were used for different applications. 

In supervised single channel source separation (SCSS), NMF is used in two 



main stages, the training stage and the separation stage (Schmidt and Olsson 



2006] IGrais and Erdogan] |2011a||b||cl |Grais et al.j |20T2| [G rais and Erdogan] 



2012c). In the training stage, NMF is used to decompose the spectrogram of 



clean training data for the source signals into a multiplication of trained basis 
and weights/gains matrices for each source. The trained basis matrix is used 
as a representative model for the training data of each source and the trained 
gains matrices are usually ignored. In the separation stage, NMF is used to 
decompose the mixed signal spectrogram as a nonnegative weighted linear 
combination of the columns in the trained basis matrices. The spectrogram 
estimate for each source in the mixed signal can be found by summing its 
corresponding trained basis terms from the NMF decomposition during the 
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separation stage. One of the main problems of this framework is that, the 
estimate for each source spectrogram is affected by the other sources in the 
mixed signaL The NMF decomposition of the weight combinations in the 
separation stage needs to be improved. To improve the NMF decomposition 
during the separation stage, prior information about the weight combinations 
for each source can be considered. 

In this work, we introduce a new method of enforcing the NMF solution 
of the weights matrix in the separation stage to follow certain estimated 
patterns. We assume we have prior statistical informations about the solution 
of the NMF weights matrix. The Gaussian mixture model (GMM) is used as 
a prior model for the valid/expected weight combination patterns that can 
exist in the columns of the weights matrix that are related to the nature of the 
source signals. Here, in the training stage, NMF is also used to decompose 
the spectrogram of the training data into trained basis and weights/gains 
matrices for each source. In this work, the trained gains matrix is used 
along with the trained basis matrix to represent each source. We can see the 
columns of the trained gains matrix as valid weight combinations that their 
corresponding bases in the basis matrix can jointly receive for a certain type 
of source signal. The columns of the trained gains matrix can be used to train 
a prior model that captures the statistics of the valid weight combinations 
that the bases can receive. The prior gain model and the trained basis matrix 
for each source can be used to represent each source in the separation stage. 
During the separation stage, the prior model can guide the NMF solution to 
prefer these valid weight patterns. The multivariate Gaussian mixture model 



(GMM) can be used to model the trained gains matrix (Grais and Erdogan 
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2012b). The GMM is a rich model which captures the statistics and the 



correlations of the valid gain combinations for each source signal. GMMs 
are extensively used in speech processing applications like speech recognition 
and speaker verification. GMMs are used to model the multi-modal nature 
in speech feature vectors due to phonetic differences, speaking styles, gender. 



accents (Rabiner and Juang, 1993). We are conjecturing that the weight 
vectors of the NMF gains matrix can be considered as a feature extracted 
from the signal in a frame so that it can be modeled well with a GMM. The 
columns in the trained weights matrix are normalized, and their logarithm 
is then calculated and used to train the GMM prior. The basis matrix and 
the trained GMM prior model for the weights are jointly used as a trained 
representative model for the source training signals. 

In the separation stage and after observing the mixed signal, NMF is 
used again to decompose the mixed signal spectrogram as a weighted linear 
combination of the trained basis vectors for the sources that are involved in 
the observed mixed signal. The conventional NMF solution for the weight 
combinations is found to minimize a predefined NMF cost function ignoring 
that, for each set of trained basis vectors of a certain source signal there is 
a set of corresponding valid weight combinations that the bases can possibly 



receive. In Grais and Erdogan (2012b), the prior GMM that models the valid 
weight combinations for each source is used to guide the NMF solution for the 



gains matrix during the separation stage. The priors in Grais and Erdogan 



(2012b) are enforced by maximizing the log-likelihood of the NMF solution 



with the trained prior GMMs. The priors in Grais and Erdogan (2012b) are 
enforced without evaluating how good the NMF solution is without using the 
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priors. For example, if the NMF solution without prior is not satisfactory, 
we would like to rely more on the priors and vice versa. 

In this work, we introduce a new strategy of applying the priors on the 
NMF solutions of the gain matrix during the separation stage. The new 
strategy is based on evaluating how much the solution of the NMF gains 
matrix needs to rely on the prior GMMs. The NMF solutions without using 
priors for the weights matrix for each source during the separation stage 
can be seen as a deformed image, and its corresponding valid weights/gains 
matrix is needed to be estimated under the GMM prior. The deformation 
operator parameters which measure the uncertainty of the NMF solution of 
the weights matrix are learned directly from the observed mixed signal. The 
uncertainty in this work is a measurement of how far the NMF solution of 
the weights matrix during the separation stage is from being a valid weight 
pattern that is modeled in the prior GMM. The learned uncertainties are used 
with the minimum mean square error (MMSE) estimator to find the estimate 
of the valid weights matrix. The estimated valid weights matrix should also 
consider the minimization of the NMF cost function. To achieve these two 
goals, a regularized NMF is used to consider the vahd weight patterns that 
can appear in the columns of the weights matrix while decreasing the NMF 
cost function. The uncertainties within MMSE estimates of the vahd weight 
combinations are embedded in the regularized NMF cost function for this 
purpose. The uncertainty measurements play very important role in this 
work as we will show in next sections. If the uncertainty of the NMF solution 
of the weights matrix is high, that means the regularized NMF needs more 
support from the prior term. In case of low uncertainty, the regularized 
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NMF needs less support from the prior term. Including the uncertainty 
measurements in the regularization term using MMSE estimate makes the 
proposed regularized NMF algorithm decide automatically how much the 
solution should rely on the prior GMM term. This is the main advantage of 
the proposed regularized NMF compared to the regularization using the log- 



likelihood of the GMM prior or other prior distributions ( Grais and Erdogan 



2012a|[b Canny, 2004). Incorporation of the uncertainties that measure the 



extent of distortion in the NMF weights matrix solutions in the regularization 
term is a main novelty of this work, which has not been seen before in the 
regularization literature. 

The remainder of this paper is organized as follows: In section |2| we give a 
brief explanation about NMF. In section |3| we discuss the problem of single 
channel source separation and its formulation. In Section |4| we show the 
conventional usage of NMF in SCSS problems. Section |5] describes the needs 
for a regularized NMF. Sections [6] to [9] introduce the new regularized NMF 
and how it is used in the SCSS problem, which is the main contribution of 



this paper. Section [TO] indicates the source signal reconstruction after NMF 
decomposition. In the remaining sections, we present our observations and 
the results of our experiments. 



2. Nonnegative matrix factorization 

Nonnegative matrix factorization is used to decompose any nonnegative 
matrix V into a multiplication of a nonnegative basis matrix B and a non- 
negative gains or weights matrix G as follows: 

V ^ BG. (1) 
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The columns of matrix B contain nonnegative basis or dictionary vectors 
that are optimized to allow the data in V to be approximated as a non- 
negative linear combination of its constituent vectors. Each column in the 
gains/weights matrix G contains the set of weight combinations that the 
basis vectors in the basis matrix have for its corresponding column in the V 
matrix. To solve for matrix B and G, different NMF cost functions can be 
used. For audio source separation applications, the Itakura-Saito (IS-NMF) 



divergence cost function (Fevotte et al. , 2009) is usually used. This cost 



function is found to be a good measurement for the perceptual differences 



between different audio signals (Fevotte et al. , 2009 Jaureguiberry et al 



2011). The IS-NMF cost function is defined as: 



min Dis(V\\BG) 
B,G 



(2) 



where 



D 



IS 



{V\\BG) = J2 



V m,n 



log 



The IS-NMF solutions for equation ^ can be computed by alternating mul- 



tiplicative updates of B and G (Fevotte et al. , 2009; Jaureguiberry et al 



2011) as: 



B ^ B® 



(BG) 



.G' 



.G' 



B 



BG' 

T V 



G^G® 



(BG) 



B 



T 1 



(3) 



(4) 



BG 



where the operation ® is an element-wise multiplication, all divisions and (.)^ 
are element-wise operations. In source separation applications, IS-NMF is 
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used with matrices of power spectral densities of the source signals (Fevotte 
et al-l |2009[ [Jaureguiberry et aTj |2011[). 



3. Problem formulation for SCSS 

In single channel source separation (SCSS) problems, the aim is to find 
estimates of source signals that are mixed on a single observation channel 
y{t). This problem is usually solved in the short time Fourier transform 
(STFT) domain. Let Y{t,f) be the STFT of y{t), where t represents the 
frame index and / is the frequency-index. Due to the linearity of the STFT, 
we have: 

Y{tJ) = S^{tJ) + S2{tJ), (5) 

where Si{t, f) and S'2(t, /) are the unknown STFT of the first and second 
sources in the mixed signal. Assuming independence of the sources, we can 
write the power spectral density (PSD) of the measured signal as the sum of 
source signal PSDs as follows: 

alitj) = a!itj) + ajitj), (6) 

where ay{t, f) = E{\Y{t, f)]"^). We can write the PSDs for all frames as a 
spectrogram matrix as follows: 

Y = Si + 52, (7) 

where Si and >S'2 are the unknown spectrograms of the source signals, and 
they need to be estimated using the observed mixed signal and training data 
for each source. The spectrogram of the measured signal Y is calculated by 
taking the squared magnitude of the STFT of the measured signal y(t). 
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4. Conventional NMF for SCSS 

In conventional single channel source separation using NMF without reg- 
ularization ( Grais et al.[|2012 ), there are two main stages to find estimates for 



Si and ^2 in equation ([T]). The first stage is the training stage and the second 
stage is the separation/testing stage. In the training stage, the spectrogram 
^tram eg,ch source is calculated by computing the squared magnitude of 
the STFT of each source training signal. NMF is used to decompose the 
spectrogram into basis and gains matrices as follows: 

5 train ri /strain retrain tj /strain /nX 

1 ^ i5lCri , *2 ~ i>2(j2 5 \°) 

the multiplicative update rules in equations ^ and Q are used to solve 
for Si,S2,Gf''" and G^''''' for both sources. Within each iteration, the 
columns of Bi and B2 are normalized and the matrices G*"^^™ and G2^^^ 
are computed accordingly. The initialization of all matrices Si, -B2, 
and G^2^^^ is done using positive random noise. After finding basis and gains 
matrices for each source training data, the basis matrices are used in the 
mixed signal decomposition as shown in the following sections. All the basis 
matrices Bi and B2 are kept fixed in the remaining sections in this paper. 

In the separation stage after observing the mixed signal y{t), NMF is 
used to decompose the mixed signal spectrogram Y with the trained bases 
matrices Bi and B2 that were found from solving equation (Isj) as follows: 



Y^[Bi,B2]G, or Y ^ [B^ B2 



Gi 
G2 



(9) 



then the corresponding spectrogram estimate for each source can be found 
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as: 

5i = SiGi, §2 = B2G2. (10) 

Let B train = [-^1,-6 2]. The only unknown here is the gains matrix G since 
the matrix Btrain "was found during the training stage and it is fixed in the 
separation stage. The matrix G is a combination of two submatrices as in 
equation ([9]). NMF is used to solve for G in ([o]) using the update rule in 
equation Q and G is initialized with positive random numbers. 

5. Motivation for regularized NMF 

The solution of the gains submatrix Gi in (|9| is affected by the existence 
of the second source in the mixed signal. Also, G2 is affected by the first 
source in the mixed signal. The effect of one source into the gains matrix 
solution of the other source strongly depends on the energy level of each 
source in the mixed signal. Therefore, the estimated spectrograms Si and 



>S'2 in equation (10) that are found from solving G using the update rule in 
^ may contain residual contribution from each other and other distortions. 
To fix this problem, more discriminative constraints must be added to the 
solution of each gains submatrix. The columns of the solution gains subma- 
trix Gi and G2 should form a valid/expected weight combinations for its 
corresponding basis matrix of its corresponding source signal. The informa- 
tion about the valid weight combinations that can exits in the gains matrix 
for a source signal can be found in the gains matrix that was computed from 
the clean training data of the same source. For example, the information 
about valid weight combinations that can exist in the gains matrix Gi in 
equation ^ can be found in its training gains matrix Q^^^^'^ in equation (|8|. 
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The columns of the trained gains matrix Q^^^^^ represent the vahd weight 
combinations that the basis matrix Bi can receive for the first source. Note 
that, the basis matrix Bi is common in the training and separation stages. 
The solution of the gains submatrix Gi in equation ^ should consider the 
prior about the valid combination that is present in its corresponding trained 
gains matrix G*''^™ in equation (p| for the same source. 



In our previous work (Grais and Erdogan, 2012b), data in the training 
gains matrix G*''^™ for source i was modeled using a GMM. The NMF solution 
of the gains matrix during the separation stage was guided by the prior GMM. 
The GMM was learned using the logarithm of the normalized columns of the 
training gains matrix. The NMF solution for the gains matrix during the 
separation stage was enforced to increase its log-likelihood with the trained 
GMM prior using regularized NMF as follows: 

Cold = Djs {Y II BtrainG) — Rold{G), (11) 

where Roid{G) is the weighted sum of the log-likelihoods of the log-normalized 
columns of the gains matrix G. Roid{G) was defined as follows: 

2 

Roid{G) = Y,m^oid{G,), (12) 

i=l 

where VoidiG.,) is the log-likelihood for the submatrix Gj, and rji is the regu- 



larization parameter for source i. The regularization parameter in Grais and 



Erdogan (2012b ) was playing two important roles. The first role was to match 
the scale of the IS-NMF divergence term with the scale of the log-likelihood 
prior term. The second role was to decide how much the regularized NMF 



cost function needs to rely on the prior term. The results in Grais and Er- 



dogan (2012b) show that, when the source i has higher energy level than 
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the other sources, the value of its corresponding regularization parameter rji 
becomes smaller than the values of other regularization parameters for the 
other sources. That can be reformed as follows: when the source has high 



energy level, the gains matrix solution of the regularized NMF in (11) rely 



less on the prior model and vice versa. The values of the regularization pa- 
rameters in 



Grais and Erdogan (2012b) was chosen manually for every energy 



level for each source. In the cases when the conjugate prior models of the 



NMF solutions were used (Virtanen et al. , 2008 Canny, 2004), the hyper- 



parameters of the prior models were also chosen manually. The conjugate 
prior models usually enforced on NMF solutions using a Bayesian framework 



(Fevotte et ah 2009 Virtanen et al. 2008 Canny, 2004). In Grais and Er 



dogan (2012b), it was also shown that, the hyper-parameter choices for the 



conjugate prior models can also depend on the energy level differences of the 
source signals in the mixed signal. 



6. Motivation for the proposed regularized NMF 



In this work, we try to use prior GMMs to guide the solution of the gains 



matrix during the separation stage using regularized NMF as in Grais and 



Erdogan (2012b) but following a totally different regularization strategy. We 



also try to find a way to estimate how much the solution of the regularized 
NMF needs to rely on the prior GMMs automatically not manually as in 



Grais and Erdogan (2012b). The way of finding how much the regularized 



NMF solution of the gains matrix needs to rely on the prior GMM is by 
measuring how far the statistics of the solution of the gains matrix Gi in 
(|9| is from the statistics of the solution of the valid gains matrix solution 
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G^™"^ in ^ for source i. Recall that, the matrix Gf'"''' in ^ contains the 
weight combinations that the columns in the basis matrix Bi can jointly 
receive for the clean data of source i. The data in Q^^^^^ can be used as a 
prior information for what kinds of weight combinations that should exist 
in Gi in (|9| since the matrix Bi is the same in (|8]) and ([9]). The matrix 
Qtram ^ ^gg^ train a prior GMM for the expected (valid) weight 



combinations that can exist in the gains matrix for source i as in Grais and 



Erdogan (2012b). The solution of the gains submatrix Gi in ([9]) can be seen 



as a deformed observation that needs to be restored using MMSE estimate 
under its corresponding GMM prior for source i. How far the statistics of 
the solution of the gains matrix Gi is from the statistics of the solution of 
the valid gains matrix solution Gf^"^ can be seen as how much the gains 
submatrix Gi is deformed. How much deformation exists in the gains matrix 
Gi can be learned directly and the logarithm of this deformation is modeled 
using a Gaussian distribution with zero mean and a diagonal covariance 
matrix ^i. When the deformation or the uncertainty measurement ^i of 
the gain submatrix Gi is high, we expect our target regularized NMF cost 
function to rely more on the prior GMM for source i and vice versa. Based on 
the measurement ^'i, the proposed NMF cost function decides automatically 
how much the solution of the regularized NMF needs to rely on the prior 
GMMs, which is a main advantage of the proposed regularized NMF over our 



previous work (Grais and Erdogan 2012b). Applying the prior information 



on the gains matrix Gi in (|9j) using MMSE estimate under a GMM prior 
using regularized NMF is the new strategy that we introduce in this paper. 
In the following sections, we give more details about training the prior 
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GMM for the gains matrix for each source. Then, we give more details 
about our proposed regularized NMF using MMSE estimate to find better 
solution for the gains matrix in ([o]). In Section [sj we present our proposed 
regularized NMF in a general manner. In Section [8} we assume we have a 
trained basis matrix B, a trained prior GMM for a clean gains matrix, and 
a gains matrix G that inherited some distortion from the original matrix V 
from solving equation ([2]). We introduce our proposed regularized NMF in 
a general fashion in Section |8] to make the idea clearer for different NMF 
applications like, dimensionality reduction, denosing, and other applications. 
The update rules that solve the proposed regularized NMF are also derived 
in Section [8] in a general fashion regardless of the application. The GMM in 
Section |8] is the trained prior GMM that captures the statistics of the valid 
weights combinations that should have been existed in the gains matrix G. In 
section [9| we show how we use the proposed regularized NMF to find better 
solutions for the gain submatrices in equation ^ for our single channel source 
separation problem. 

7. Training the GMM prior models 

We use the gains matrices G*'"'^™ and G^™'^ equation ^ to train prior 
models for the expected/valid weight patterns in the gains matrix for each 
source. For each matrix G^{^^^ and G2^"\ we normalize their columns and 
then calculate their logarithm. The normalization in this paper is done using 
the Euclidean norm. The log-normalized columns are then used to train a 
gains prior GMM for each source. The GMM for a random variable x is 
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defined as: 

K 



A:=l l^^/ l-^fel ^ 

where K is the number of Gaussian mixture components, vr^ is the mixture 
weight, d is the vector dimension, is the mean vector and is the 
diagonal covariance matrix of the k^^ Gaussian model. In training GMM, the 



expectation maximization (EM) algorithm (Dempster et al. , 1977) is used to 
learn the GMM parameters (7rfc,^^,Sfc, \fk = {1,2, K}) for each source 
given its trained gain matrix Q*™"^. The suitable value for K usually depends 
on the nature, dimension and the size of the available training data. We use 
the logarithm because it has been shown that the logarithm of a variable 



taking values between and 1 can be modeled well by a GMM (Wessel et al. 



2000). Since the main goal of the prior model is to capture the statistics 



of the patterns in the trained gains matrix, we use normalization to make 
the prior models insensitive to the energy level of the training data. The 
normalization makes the same prior models applicable for a wide range of 
energy levels and avoids the need to train a different prior model for different 
energy levels. By normalization we are modeling the ratio and correlation 
between the combination of the weights that the bases can jointly receive. 



8. The proposed regularized NMF 

The goal of regularized NMF is to incorporate prior information on the 
solution matrices B and G. In this work, we enforce a statistical prior 
information on the solution of the gains/weights matrix G only. We need 
the solution of the gains matrix G to minimize the IS-divergence cost function 
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in equation ([2]), and the columns of the gains matrix G should form valid 
weight combinations under a prior GMM model. 

The most used strategy for incorporating a prior is by maximizing the 
likelihood of the solution under the prior model while minimizing the NMF 
divergence at the same time. To achieve this, we usually add these two 



objectives in a single cost function. In Grais and Erdogan (2012b), a GMM 
was used as the prior model for the gains matrix, and the solution of the gains 
matrix was encouraged to increase its log-likelihood with the prior model 
using this regularized NMF cost function. The regularization parameters 
in Grais and Erdogan (2012b) were the only tools to control how much the 



regularized NMF relies on the prior models based on the energy differences of 
the sources in the mixed signal. The values of the regularization parameters 
were changed manually in that work. 

Gaussian mixture model is a very general prior model where we can see 
the means of the GMM mixture components as "valid templates" that were 



observed in the training data. Even, Parzen density priors (Kim et al. , 2007) 
can be seen under the same framework. In Parzen density prior estimation, 
training examples are seen as "valid templates" and a fixed variance is as- 
signed to each example. In GMM priors, we learn the templates as cluster 
means from training data and we can also estimate the cluster variances from 
the data. We can think of the GMM prior as a way to encourage the use of 
valid templates or cluster means in the NMF solution during the separation 
stage. This view of the GMM prior will be helpful in understanding the 
MMSE estimate method we introduce in this paper. 

We can find a way of measuring how far the conventional NMF (NMF 
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without prior) solution is from the trained templates in the prior GMM 
and call this the error term. Based on this error, the regularized NMF 
can decide automatically how much the solution of the NMF needs help 
from the prior model. If the conventional NMF solution is far from the 
templates then the regularized NMF will rely more on the prior model. If 
the conventional NMF solution is close to the templates then the regularized 
NMF will rely less on the prior model. By deciding automatically how much 
the regularized NMF needs to rely on the prior we conjecture that, we do not 
need to manually change the values for the regularization parameter based 
on the energy differences of the sources in the mixed signal to improve the 



performance of NMF as in Grais and Erdogan (2012b). 



We use the following way of measuring how far the conventional NMF 
solution is from the prior templates: We can see the solution of the conven- 
tional NMF as distorted observations of a true/valid template. Given the 
prior GMM templates, we can learn a probability distribution model for the 
distortion that captures how far the observations in the conventional gains 
matrix is from the prior GMM. The distortion or the error model can be seen 
as a summary of the distortion that exists in all columns in the gains matrix 
of the NMF solution. 

Based on the prior GMM and the trained distortion model, we can find a 
better estimate for the desired observation for each column in the distorted 
gains matrix. We can mathematically formulate this by seeing the solution 
matrix G that only minimizes the cost function in equation ([2]) as a distorted 



^In this paper, the regularization parameters are chosen once and kept fixed regardless 
of the energy differences of the source signals 
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image where its restored image needs to be estimated. The columns of the 
matrix G are normahzed using the £^ norm and their logarithm is then 
calculated. Let the log- normalized column n namely (log n ^"n ) of the gains 

\\S ri II 2 

matrix be q^. The vector qr„ is treated as a distorted observation as: 

q^ = Xn + e, (14) 

where a;„ is the logarithm of the unknown desired pattern that corresponds 
to the observation qr„ and needs to be estimated under a prior GMM, e is 
the logarithm of the deformation operator, which is modeled by a Gaussian 
distribution with zero mean and diagonal covariance matrix ^ as ^(e|0, 
The GMM prior model for the gains matrix is trained using log-normalized 
columns of the trained gains matrix from training data as shown for example 
in Section [7j The uncertainty ^ is trained directly from all the log-normalized 
columns of the gains matrix q = {q^i, ••, .., ^iv}? where is the number 
of columns in the matrix G. The uncertainty ^ can be iteratively learned 
using the expectation maximization (EM) algorithm. Given the prior GMM 
parameters which are considered fixed here, the update of ^ is found based 



on the sufficient statistics z„ and Rn as follows (Rosti and Gales 



2001 



2004 



Ghahramani and Hinton, 1997) [Appendix A]: 

where the "diag" operator sets all the off-diagonal elements of a matrix to 
zero, A^ is the number of columns in matrix G, and the sufficient statistics 
Zn and Rn can be updated using ^ from the previous iteration as follows: 

K 

Zn = y^^lknZkn, (16) 
fe=l 
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and 



where 



K 



k=l 



Ikn 



*1 



and 



(17) 

(18) 
(19) 

(20) 



= Mfc + 5]fc (Sfc + *) (qr„ - . 

^ is considered as a general uncertainty measurement over all the observa- 
tions in matrix G. ^ can be seen as a model that summarizes the deformation 
that exists in all columns in the gains matrix G. 

Given the GMM prior parameters and the uncertainty measurement ^ , 
the MMSE estimate of each pattern given its observation under the 



observation model in equation ( 14 ) can be found similar to Rosti and Gales 



(2001 2004), and Ghahramani and Hinton (1997) as in Appendix A as fol 



lows: 



where 



K 



fe=i 



'Jkn 



7rfc^(g„|^fc,S^ 



*1 



Ef=l^.^(9nlMi,S,+*) 



(21) 



(22) 



The value of ^ in the term (S^ + ^') ^ in equation (21) plays an im 



portant role in this framework. When the entries of the uncertainty ^ are 
very small comparing to their corresponding entries in for a certain active 
GMM component k, the term S^, (S^ + ^)~^ tends to be the identity matrix, 



and MMSE estimate in (21) will be the observation qr„. When the entries of 
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the uncertainty ^' are very high comparing to their corresponding entries in 
Xlfc for a certain active GMM component k, the term "Ek (Sfc + ^)~^ tends 
to be a zeros matrix, and MMSE estimate will be the weighted sum of prior 
templates J2k=i Iknt^k- most cases 7fc„ tends to be close to one for one 
Gaussian component, and close to zero for the other components in a large 
dimension space. This makes the MMSE estimate in the case of high ^ to be 
one of the mean vectors in the prior GMM, which is considered as a template 
pattern for the valid observation. We can rephrase this as follows: When the 
uncertainty of the observations q is high, the MMSE estimate of relies 
more on the prior GMM of x. When the uncertainty of the observations 
q is low, the MMSE estimate of x, relies more on the observation q^. In 
general, the MMSE solution of x lies between the observation and one 
of the templates in the prior GMM. The term (S^ + *) controls the 
distance between Xn and g„ and also the distance between and one of the 
template ^ij. assuming that 7fc„ ~ 1 for a Gaussian component k. 



The model in equation ( 14 ) expresses the normalized columns of the gains 



matrix as a distorted image with a multiplicative deformation diagonal ma- 
trix. For the normalized gain columns n ^"n of G there is a deformation 

\\q 

n II2 

matrix E with log-normal distribution that is applied to the correct pattern 
that we need to estimate as follows: 

= (23) 

\\9n\\2 

The uncertainty for E is represented in its covariance matrix ^ . For the 
distorted matrix G we find its corresponding MMSE estimate for its log- 
normalized columns G. Another reason for working in logarithm domain is 
that, the gains are constrained to be nonnegative and the MMSE estimate 
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can be negative so the logarithm of the normahzed gains is an unconstrained 
variable that we can work with. The estimated weight patterns in G that 
are corresponding to the MMSE estimates for the correct patterns do not 
consider minimizing the NMF cost function in equation ([2]), which is still the 
main goal. We need the solution of G to consider the pattern shape priors 
on the solution of the gains matrix, and also considers the reconstruction 
error of the NMF cost function. To consider the combination of the two 
objectives, we consider using the regularized NMF. We add a penalty term 
to the NMF-divergence cost function. The penalty term tries to minimize 
the distance between the solution of log-normalized columns of with its 
corresponding MMSE estimate f{gn) as follows: 



log ■ 



/log 



exp / lo; 



\\9n\\2 \ \\9n\\2/ \\9n\\2 \ \ \\9n\\2 

The regularized IS-NMF cost function is defined as follows: 



(24) 



C = Dis{V\\BG) + aL{G), 



(25) 



where 



TV 



n=l 



9n 



\9n\ 



exp / log 



9r 



\9n\ 



(26) 



/ ^log ||^"|| ^ is the MMSE estimate defined in equation (21), and a is a 
regularization parameter. The regularized NMF can be rewritten in more 
details as 



-ll+aV 



' il 
(27) 



In equation (27), the MMSE estimate of the desired patterns of the gains 



matrix is embedded in the regularized NMF cost function. The first term 
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in (27), decreases the reconstruction error between V and BG. Given 
we can forget for a while the MMSE estimate concept that leaded us to our 



target regularized NMF cost function in (27) and see equation (27) as an 



optimization problem. We can see from (27) that, if the distortion measure- 
ment parameter ^ is high, the regularized nonnegative matrix factorization 
solution for the gains matrix will rely more on the prior GMM for the gains 
matrix. If the distortion parameter ^' is low, the regularized nonnegative 
matrix factorization solution for the gains matrix will be close to the ordi- 
nary NMF solution for the gains matrix without considering any prior. The 



second term in equation (27) is ignored in the case of zero uncertainty 
In case of high values of the second term encourages to decrease the 
distance between each normalized column u^"u in G with a corresponding 

II II 2 

prior template exp (^t^) assuming that 'jkn ~ 1 for a certain Gaussian com- 
ponent k. For different values the penalty term decreases the distance 
between each n ^"n and an estimated pattern that lies between a prior tem- 

Il^"ll2 

plate and ^"n . The term (log Tr^Hi fJ-k) in (27) measures how far each 

log-normalized column in the gains matrix is from a valid template ^i^- Under 
the assumption 'jkn ~ 1 for a certain Gaussian component k, the second term 



in (27) is also ignored when the observation log n ^"n form a valid pattern 

\\9 n\ \ o 



(log 



Q 

\q II 



l-ij^). How far each log-normalized column in the gains matrix is 



from a valid template decides how much influence the MMSE estimate prior 



term has to the solution of (27) for each observation. 



The multiplicative update rule for S in (27) is still the same as in equation 
The multiphcative update rule for G can be found by following the same 



procedures as in Virtanen (2007); Bertin et al. (2010); Grais and Erdogan 
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(2012b). The gradient with respect to G of the cost function V gC can be 
expressed as a difference of two positive terms VqC and VqC as follows: 



(28) 



The cost function is shown to be nonincreasing under the update rule (Vir 



tanen, 


2007 


Bertin et al. 


2010) 



(29) 



where the operations ® and division are element- wise as in equation (|4]). We 
can write the gradients as: 



VgC = VgDis + aVcLiG) 



(30) 



where 'WgL{G) is a matrix with the same size of G. The gradient for the IS- 
NMF and the gradient of the prior term can also be expressed as a difference 
of two positive terms as follows: 



and 



VgDis = V^Dis - VqDis, 



VgHG) =V+L{G) -VaHG). 



We can rewrite equations (28, 30) as: 



VgC = {V^Djs + aV+L(G)) - (V^/^/s + «VgI^(G)) 



(31) 



(32) 



(33) 



The final update rule in equation (29) can be written as follows: 



Vg^is + aVcHG) 
V+Dis + aV+L{G)' 
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(34) 



where 

V,D,s = B-^-B-^, (35) 

VrDis = B^ ^ and V+D/s = -B^^- (36) 

Note that, in calculating the gradients V(tL(G) and V(^L(G), the term 
is also a function of G. The gradients VqL(G) and V^L(G) are calculated 



in Appendix B. Since all the terms in equation (34) are nonnegative, then 



the values of G of the update rule (34) are nonnegative. 



9. The proposed regularized NMF for SCSS 

In this section, we are back to the single channel source separation prob- 
lem to find a better solution to equation (|9]). Figure [l] shows the fiow chart 
that summarizes all stages of applying our proposed regularized NMF method 
for SCSS problems. Given the trained basis matrices -Bi, B^ that were com- 
puted from solving ([s]) , and the trained gain prior GMM for each source from 
Section [7} we try to apply the proposed regularized NMF cost function in 
Section [s] to find better solution for the gain submatrices in equation (|9|. 
The bases matrix Btrain = [-61,-62] is still fixed here, we just need to update 
the gains matrix G in ([9]). The normalized columns of the submatrices Gi 
and G2 in equation ^ can be seen as deformed images as in equation (23) 



and their restored images are needed to be estimated. First, we need to learn 
the uncertainties parameters ^'i and ^'2 for the deformation operators Ei 
and E2 respectively for each image as shown in learning the uncertainties 
stage in Figure [Tj The columns of the submatrix Gi are normalized and 
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Figure 1: The flow chart of using regularized NMF with MMSE estimates under GMM 
priors for SCSS. The term NMF+MMSE means regularized NMF using MMSE estimates 
under GMM priors. 



their logarithm are calculated and used with the trained GMM prior param- 
eters for the first source to estimate ^'i iteratively using the EM algorithm 



in equations (15) to (20). The log-normalized columns "log n " of Gi can 



9 n 



be seen as in equations (15) to (20). We repeat the same procedures to 
calculate ^2 using the log-normalized columns of G2 and the prior GMM for 
the second source. The uncertainties ^'i and ^'2 can also be seen as measure- 
ments of the remaining distortion from one source into another source, which 
also depends on the mixing ratio between the two sources. For example, if 
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the first source has higher energy than the second source in the mixed signal, 
we expect the values of to be higher than the values in ^'i and vice versa. 
After calculating the uncertainty parameters for both sources ^'i and ^'2, 



we use the regularized NMF in (25) to solve for G with the prior GMMs for 



both sources and the estimated uncertainties ^'i and ^'2 as follows: 



C — Djs (Y 1 1 BtrainG 



R{G), 



where 



i?(G) = «iLi(Gi) + «2^2(G2) 



(37) 



(38) 



Li(Gi) is defined as in equation (26) for the first source, L2{G2) is for the 



second source, ai, and a2 are their corresponding regularization parameters. 



The update rule in equation (34) can be used to solve for G after modifying 
it as follows: 



G^G® 



(39) 



V%Dis + VlR{Gy 
where Vq-R(G) and VqR{G) are nonnegative matrices with the same size 
of G and they are combinations of two submatrices as follows: 



VaR{G) 



aiVcHGi) 
a2VaL{G2) 



V^R{G) 



aiVlL{Gi) 
a2V%L{G2) 



(40) 



where VqL{Gi),VqL{Gi),VqL{G2), and VqL{G2) are calculated as in 
section |8] for each source. 

The normalization of the columns of the gain matrices are used in the 
prior term R{G) and its gradient terms only. The general solution for the 



gains matrix of equation (37) at each iteration is not normalized. The nor- 



malization is done only in the prior term since the prior models have been 
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trained by normalized data before. Normalization is also useful in cases where 
the source signals occur with different energy levels from each other in the 
mixed signal. Normalizing the training and testing gain matrices gives the 
prior models a chance to work with any energy level that the source signals 
can take in the mixed signal regardless of the energy levels of the training 
signals. 



The regularization parameters in (38) have only one role. They are cho- 
sen to match the scale between the NMF divergence term and the MMSE 



estimate prior term in the regularized NMF cost function in (37). There is 
no need to change the values of the regularization parameters according to 



the energy differences of the source signals in the mixed signal as in Grais 



and Erdogan (2012b). Reasonable values for the regularization parameters 



are chosen manually and kept fixed in this work. Another main difference 



between the regularized NMF in Grais and Erdogan (2012b) that is shown in 



equation (11) and the proposed regularized NMF in this paper is related to 
the training procedures for the source models. In both works, the main aim 
of the training stage is to train the basis matrices and the gains prior GMMs 



for the source signals. In Grais and Erdogan (2012b), to match between the 
way the trained models were used during training with the way they were 
used during separation, the basis matrices and the prior GMM parameters 



were learned jointly using the regularized NMF cost function in (11). The 



joint training for the source models was introduced in Grais and Erdogan 



(2012b) to improve the separation performance. In joint training, after up- 
dating the gains matrix at each NMF iteration using the gain update rule for 



the regularized NMF in (11), the GMM parameters were then updated (re- 
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trained). Since, we needed to update (retrain) the GMM parameters at each 
NMF iteration, joint training slowed down the training of the source models 
in 



Grais and Erdogan (2012b). Another problem of using joint training is 



that, we had other regularization parameters during the training stage that 
needed to be chosen. Using joint training duplicates the number of the reg- 
ularization parameters that need to be chosen. Choosing the regularization 



parameters in Grais and Erdogan (2012b) was done using validation data. 



That means, in Grais and Erdogan (2012b) we had to train many source 
models (basis matrix and prior GMM) for different regularization parameter 
values. Then, we chose the best combination for the regularization parame- 
ter values in training and separation stages that gave the best results during 
the separation stage. In the case of using MMSE estimate regularization for 
NMF, we do not need to use joint training. In this paper, we do not need 



to consider solving the regularized NMF in (27) during the training stage to 



solve (|8]). In the training stage, the training data for each source is assumed 
to be clean data. Since the spectrogram of each source training data repre- 
sents clean source data, the NMF solution for the gains matrix can not be 
seen as a distorted image. Therefore, the deformation measurement parame- 
ter is a matrix of zeros. When = 0, the MMSE estimates prior 



term in (27) will disappear because 7^" — 1- Then, the regularized 



NMF (27) becomes just NMF. That means, we do not need to use the reg- 



ularized NMF during the training stage which is not the case in Grais and 



Erdogan (2012b). Here in the training stage, we just need to use IS-NMF to 



decompose the spectrogram of the training data into trained basis and gains 
matrices. After the trained gains matrix is computed, it is used to train the 
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prior GMM as shown in Sections |4] and [7j 



10. Source signals reconstruction 

After finding tlie suitable solution for the gains matrix G in Section |9} 



the initial estimated spectrograms Si and >S'2 can be calculated from (10) 
and then used to build spectral masks as follows: 

H, = H, = (41) 

Si + S2 Si + S2 

where the divisions are done element-wise. The final estimate of each source 

STFT can be obtained as follows: 

Si {t, f) = Hi {t, f) Y (t, f) , §2 {t, f) = H2 {t, f) Y {t, f) , (42) 

where Y (t, f) is the STFT of the observed mixed signal in equation (|5|, 
Hi (t, /) and H2 (t, /) are the entries at row / and column t of the spectral 
masks Hi and H2 respectively. The spectral mask entries scale the observed 
mixed signal STFT entries according to the contribution of each source in 
the mixed signal. The spectral masks can be seen as the Wiener filter as 
in Fevotte et al. (2009). The estimated source signals si(t) and S2(t) can 
be found by using inverse STFT of their corresponding STFTs S'l (t, f) and 

S2{tJ). 

11. Experiments and Discussion 

We applied the proposed algorithm to separate a speech signal from a 
background piano music signal. Our main goal was to get a clean speech 
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signal from a mixture of speech and piano signals. We simulated our algo- 
rithm on a collection of speech and piano data at 16kHz sampling rate. For 
speech data, we used the training and testing male speech data from the 
TIMIT database. For music data, we downloaded piano music data from the 



piano society web site (URL, 2009a). We used 12 pieces with approximate 
50 minutes total duration from different composers but from a single artist 
for training and left out one piece for testing. The PSD for the speech and 
music data were calculated by using the STFT: A Hamming window with 
480 points length and 60% overlap was used and the FFT was taken at 512 
points, the first 257 FFT points only were used since the conjugate of the 
remaining 255 points are involved in the first points. We trained 128 basis 
vectors for each source, which makes the size of ^speech and -B music matrices 



to be 257 x 128, hence, the vector dimension d = 128 in equation (13) for 
both sources. The mixed data was formed by adding random portions of the 
test music file to 20 speech files from the test data of the TIMIT database at 
different speech-to-music ratio (SMR) values in dB. The audio power levels 
of each file were found using the "audio voltmeter" program from the G.191 



ITU-T STL software suite (URL, 2009b). For each SMR value, we obtained 
20 mixed utterances this way. We used the first 10 utterances as a validation 
set to choose reasonable values for the regularization parameters aspeech and 
«music and the number of Gaussian mixture components K. The other 10 
mixed utterances were used for testing. 

Performance measurement of the separation algorithm was done using 
the signal to noise ratio (SNR). The average SNR over the 10 test utterances 
for each SMR case are reported. We also used signal to interference ratio 
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(SIR), which is defined as the ratio of the target energy to the interference 



error due to the music signal only (Vincent et al. , 2006). 



Table [T] shows SNR and SIR of the separated speech signal using NMF 
with different values of the number of Gaussian mixture components K and 
fixed regularization parameters Ospeech = «music = 1- The first column of the 
Table, shows the separation results of using just NMF without any prior. 



Table 1: SNR and SIR in dB for the estimated speech signal with regularization parameters 
•^speech = music = 1 ^^d different number of Gaussian mixture components K . 



SMR 
dB 


No 
SNR 


irior 
SIR 


K 
SNR 


= 1 
SIR 


K 
SNR 


= 4 
SIR 


K 
SNR 


= 8 
SIR 


K 
SNR 


= 16 
SIR 


K -- 
SNR 


= 32 
SIR 


-5 


2.88 


4.86 


3.31 


5.71 


3.61 


6.58 


4.24 


8.07 


4.76 


10.07 


4.27 


8.39 





5.50 


8.70 


5.74 


9.31 


5.90 


9.99 


6.32 


11.61 


6.45 


13.02 


6.54 


12.42 


5 


8.37 


12.20 


8.46 


12.40 


8.55 


12.98 


8.74 


14.13 


8.73 


15.62 


8.69 


14.51 



As we can see from the Table, the proposed regularized NMF algorithm 
improves the separation performance for challenging SMR cases compared 
with using just NMF without priors. Increasing the number of Gaussian 
mixture components K improves the separation performance until K = IQ. 
From the shown results, K = IQ seems to be a good choice for the given 
data sets. The best choice for K usually depends on the nature and the 
size of the training data. For example, for speech signal in general there are 
variety of phonetic differences, gender, speaking styles, accents, which raises 
the necessity for using many Gaussian components. 



32 



Comparison with other priors 

In this section we compared our proposed method of using MMSE es- 
timates under GMM prior on the solution of NMF with two other prior 
methods. The first prior is the sparsity prior and the second prior is enforced 
by maximizing the loghkehhood under GMM prior distribution. 

In the sparsity prior, the NMF solution of the gains matrix was enforced 



to be sparse (Virtanen and Cemgil, 2009 Schmidt and Olsson 2006). The 



sparse NMF is defined as 



C (G) = Djs {Y\\BG) + XY, G^,n, 



(43) 



where A is the regularization parameter. The gain update rule of G can be 
found as follows: 



G^G 



(BG) 



B 



T 1 



BG 



+ A 



(44) 



Enforcing sparsity on the NMF solution of the gains matrix is equivalent 
to model the prior of the gains matrix using exponential distribution with 



parameter A (Virtanen and Cemgil, 2009). The update rule in equation (44) 



is found based on maximizing the likelihood of the gains matrix under the 
exponential prior distribution. 

The second method of enforcing prior on the NMF solution is by using 
GMM gain prior dGrais and Erdoganf |2012a||b| . The NMF solution for the 
gains matrix is enforced to increase its log-likelihood with the trained GMM 
prior as follows: 

C = DisiY\\BG)~R2{G), (45) 
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where R2{G) is the weighted sum of the log-hkehhoods of the log-normahzed 
columns of the gains matrix G. R2{G) can be written as follows: 

2 

R2{G) = Y,V^nG^), (46) 

1=1 

where T{Gi) is the log-likelihood for the submatrix Gi for source i. 

In sparsity and GMM based log-likelihood prior methods, to match be- 
tween the used update rule for the gains matrix during training and sepa- 
ration, the priors were enforced during both training and separation stages. 
In sparse NMF we used sparsity constraints during training and separation 
stages. In regularized NMF with GMM based log-likelihood prior we trained 



the NMF bases and the prior GMM parameters jointly as shown in Grais 



and Erdogan (2012b). 



In the sparse NMF case, we got best results when A = 0.0001 for both 
sources in the training and separation stages. In the case of enforcing the 



gains matrix to increase the log-likelihood under GMM prior (Grais and 



Erdogan, 2012b) we got the best results when r] = 1 in the training and 



7] = 0.1 in the separation stage. The number of Gaussian components was 
K = 4 for both sources. It is important to note that, in the case of using 
MMSE under GMM prior there is no need to enforce prior during training 
since the uncertainty measurements during training are assumed to be zeros 
since the training data are clean signals. When the uncertainty is zero, then 
the regularized NMF in case of MMSE under GMM prior is the same as the 
NMF cost function, then the update rule for the gains matrix in the training 
stage is the same as the update rule in the case of using just NMF. 

Figures |2] and [3] show the SNR and SIR for the different type of prior 
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SNR for different gain priors for regularized NMF 




-5 5 

SIVIR (dB) 



Figure 2: The effect of using different prior models on the gains matrix. The black line 
for using no prior case, the red line for using the exponential distribution prior, the green 
line is for maximizing the gains matrix likelihood with the GMM prior, and the blue line 
is for using MMSE under GMM as a prior 

models. The black line shows the separation performance in the case of no 
prior is used. The red line shows the performance for the case of using sparse 
NMF. The green line shows the performance in the case of enforcing the gains 
matrix to increase its likelihood with the prior GMM. The blue line shows 
the separation performance in the case of using MMSE estimate under GMM 
prior that is proposed in this paper. As we can see, the proposed method 
of enforcing prior on the gains matrix using MMSE estimate under GMM 
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SMR 



Figure 3: The effect of using different prior models on the gains matrix with the same 
color map of the previous figure. 

prior gives the best performance comparing with the other methods. The 
used MMSE estimates prior in this work gives better results than the GMM 



hkehhood method (Grais and Erdogan, 2012b) because of the measurements 



of the uncertainties in the MMSE under GMM case. The uncertainties work 
as feedback measurements that adjust the needs to the prior based on the 
amount of distortion in the gains matrix during the separation stage. 

Comparing the relative improvements in dB that we got in this paper with 
the achieved improvements in other works (Wilson et al. , 2008b|[a Virtanen 



and Cemgil 


2009; 


Virtanen, 


2007 
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paper can be considered to be high. 



12. CONCLUSION 

In this work, we introduced a new regularized NMF algorithm. The 
NMF solution for the gains matrix was guided by the MMSE estimate un- 
der a GMM prior where the uncertainty of the observed mixed signal was 
learned online from the observed data. The proposed algorithm can be ex- 
tended for better measurements of the distortion in the observed signal by 



embedding more parameters in equation ( 14 ) that can be learned online from 
the observed signal. 
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APPENDIX A 

In this appendix, we show the MMSE estimate and the parameter ^ 



learning similar to Rosti and Gales (2001), Ghahramani and Hinton (1997), 



and Rosti and Gales (2004). Assume we have a noisy observation y as shown 



in the graphical model in Figure [4], which can be formulated as follows: 

y = x + e, (47) 




Figure 4: The graphical model of the observation model. 

where e is the noise term, and x is the unknown underlying correct signal 
which needs to be estimated under a GMM prior distribution: 

K 

P{x) = J2'"'^^^\^^k,'^k), (48) 

fc=i 
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the error term e has a Gaussian distribution with zero mean and diagonal 
covariance matrix 

p(e) = ^(e|0,*). (49) 

The conditional distribution of y is a Gaussian with mean x and diagonal 
covariance matrix 

p{y\x,k) = .yi^{y\x,^) . (50) 

The distribution of y given the Gaussian component is a Gaussian with 
mean and diagonal covariance matrix + 



p{y\k) = yi^{y\n^,^k + ^)- 



(51) 



The marginal probability distribution of is a GMM: 

K 



p{y) = J2 T^k^^ivlf^k, + *) 



(52) 



k=l 



where the expectations E (x) = E{y), and E (e) = 0. Note that, this 
observation model has some mathematical similarities but different concepts 
with factor analysis models assuming the load matrix is the identity matrix 



(Rosti and Gales 2001 Ghahramani and Hinton 1997 Rosti and Gales 



2004 Jordan and Bishop). 

The MMSE estimate of x can be found by calculating the conditional 
expectation of x given the observation y. Given the Gaussian component 
k, the joint distribution of x and y is a multivariate Gaussian distribution 



with conditional expectation and conditional variance as follows (Rosti and 



Gales , 


2001 


Leon- Garcia 


1994 



E{x\y,k) = fi^ 



^kxy^kl (y-t^k) 



(53) 
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var {x\y, /c) = - ^ S^, 



xy 



(54) 



we know that 



S 



(55) 



and 



^kxy = GOV (CC, y) 

^E{xy^)-E{x)E{y^) 

^E[x {x^ + e"^)] -E{x)E (y^) 

= E {xx'^) +E{x)E (e^) - E (x) E (t/^) 

= var (x) + E(x)E (a;^) - E (x) E (y^) 



The conditional expectation given the Gaussian component k of the prior 
model is 



E {x\y, k) = Hk + Sfc (Sfe + *) '{y- /xj 



We also can find the following conditional expectation given only the obser- 
vation y as follows: 



var {x) = Sfc. 



(56) 



= Xk- 



(57) 



44 



K 



E{x\y) = J2E{k\y)E{x\y,k) 

k=l 
K 

= ^7kE {x\y,k) 



k=l 



X, 



where 



E {k\y) = —j^ — = 7fc. 

E,=i7riP(?/b) 



(58) 



(59) 



From equations (57, 58, 59) we can write the final MMSE estimate of x 



given the model parameters as follows: 



K 



^ = ^lk[^ik + Sfc (Sfc + *) \y- Aifc)] . 



(60) 



fe=i 



We need also to find the following sufficient statistics to be used in esti- 
mating the model parameters: 



var {x\y, A;) = - S,, (S^ + ^) ' S 



1 

k 1 



(61) 



E {xx'^\y^ k) = var {x\y, k) + E {x\y, k) E {x\y, kY 



Sfc - Sfe (Sfc + *) 



-1 



(62) 
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and 



K 



E {xx^\y) =Y.E {k\y) E {xx'^ly, k) 

k=l 
K 

= ^ikE {xx'^\y,k) 

k=l 
K 



k=l 

R. 



(63) 



Parameters learning using the EM algorithm 

In the training stage, we assume we have clean data with e = 0. The prior 
GMM parameters vr, ^t, S are learned as regular GMM models. The only pa- 
rameter that need to be estimated is which is learned from the deformed 
signal in the paper. The parameter ^' is learned iteratively using max- 
imum likelihood estimation. Given the data points q = Qi, q2, ■■, 9„, ^at, 
and the GMM parameters, we need to find an estimate for ^. We follow 



the same procedures as in Rosti and Gales (2001), Ghahramani and Hinton 



(1997), and Rosti and Gales (2004). 



Lets rewrite the sufficient statistics in equations (59, 57, 60, 62, 63) af 



ter replacing x with z (to avoid confusion between calculating MMSE and 
training the model parameters) as follows: 

7rfc^(qr„|^„Sfc + *) 



'ykn 



Ef=iVr,^(gJ^„S, + *) 



(64) 



Zkn = E {z\q„, k) = ^if, + Sfc (Sa 



*) ^Qn-tJ-k) 



(65) 
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K 



Zn^ E{z\q^) = ^-fknZkn, (66) 



k=l 



Rkn = E {zz'^lq^,, A;) = Sfe - (S^ + *) ' + (67) 

and 

K 

Rn^E {zZ^\q^) = Yl ^knRkn. (68) 



k=l 



The complete log-likelihood can be written in a product form as follows: 

N K 



I {q, z, k\n, S, TT, *) = log Yl Y[p{k)p{z\k)p{qjz, k), 

n=l k=l 
N K 

= log n n [^k^i^ll^k, Sfe) ^{qjz, *)]'= , (69) 



n=l k=l 



N K N K N K 



I {q, Z, k\^l, S, TT, *) = ^ ^ /c log 7rk+J2 Yl ^ -^(^iMfc, Sfe) + ^ Y ^ -^(^nl^, *) ■ 

n=l k=l n=l k=l n=l fe=l 

(70) 

The conditional expectation of the complete log hkehhood, which is con- 
ditioning on the observed data q„ can be written as: 



N K N K 



Q = E E ^^\ln) log^fe + E E (^l9n) Eq^ (log^(z|M„ Sfc) kn) 

n=l fc=l n=l fc=l 

+ EE^9n (^l9J^q„ (log^(q„|z,*) |qj , (71) 



n=l fe=l 

given that 



(^l9n) = ^ ^ ^,T,^ " 
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We can write the complete log-likelihood as follows: 

N K N K 

Q = '^'^lkn'^Og7lk + '^'^lknEq^ (log^(z|Aifc,5]fc) |g„) 



n=l k=l n=l k=l 

N K 



+ E E^'^-^^^ i^og ^{qjz, *) |gj . (73) 

n=l k=l 

For the parameter we need to maximize the third part of equation 



(73) with respect to 

N K 



Qqu =EE^'^"^9„ (iog^(9„i2,*)igj 

n=l k=l 

= E E ^knEq I log 1 ^ exp | — ^ (g„ - zf (g„ - z) \ \q^, k 

= E E ^kuEq^ i — log (27r) - - log |*| - - (g„ - 2)^ (g„ - z) |qr„, A; 

n=l k=\ ^ 

(74) 

the derivative of Qq^ with respect to is set to zero: 

S% = E E ^'^"^g^ - ^ - ^) - l^n, ^ = 0, (75) 

n=l fc=l ^ ^ 



Af N K N K 



* E E ^-^^ = E E ^knQnQl - E ^" E ^^'^^Q^ (^1^", 



n=l fc=l n=l A:=l n=l fc=l 

^qn^lknEq^{z\q^^,kf \ +EE^^"^9„ (22^l9„,^) 

n=l k=l / n=l /c=l 

(76) 

we know that 

N K K 

E E '^^^ ^ ^ ^^"^ E '^'^^ ^ 

n=l fc=l fc=l 
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then 



and 



N 



K 



N 



T 
n ' 



n=l k=l 
N K 



n=l 



n=l k=l 

We can use the values of X]f=i IknEq^ (^|g„> k) and X]f=i IknEq^ [zz^\q^, k) 



from equations (66, 68) to find the estimate of ^ as follows: 



f 1 ^ 

I n=l 



(77) 



where the "diag" operator sets all the off-diagonal elements of a matrix to 
zero. 

APPENDIX B 

In this appendix, we show the gradients of the penalty term in the reg- 
ularized NMF cost function in section 2.1. To calculate the update rule for 
the gains matrix G, the gradients V^L(G) and V^L(G) are needed to be 
calculated. Lets recall the regularized NMF cost function 



C{G) = Dis{V\\BG) + aL{G), 



where 



N 



9n 



- exp (/ {g^^)) 



n\\2 



/ {9n) = E ^^r. + (S. + *)-' log ^ - ^Ji, 

fc^l L V \\9n\\2 / 



(78) 



(79) 



(80) 
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and 



Since the training data for the GMM models are the logarithm of the normal- 



9 71 



(81) 



ized vectors, then the mean vectors of the GMM are always not positive, also 
the values of log n ^"n are also not positive, and is always nonnegative. 

II 2 

Let gr„ = X, and its component a is gr„^ = Xa, and f{gn) = f{x). We can 



write the constraint in equation (|79|) as: 

L(x) = 



X 



\x\ 



exp (fix)) 



(82) 



The a component of the gradient of L{x) is 



dXa 



X 



\x\ 



\x\ 



Vf{Xa)exp{f{Xa)) 



VLiXa) 



(83) 



which can be written as a difference of two positive terms 



VL{Xa) = V^L{Xa) L{Xa). 



(84) 



The component a of the gradient of / {x) can be written as a difference of 
two positive terms: 

df{x) 



dXa 



V^f{Xa)-V-fiXa) 



(85) 



The component a of the gradient of L (x) in equation (84) can be written as: 

V+L{xa)^2\j^(^+exp{f{xa))V-fixa)) +exp(/(xj) ( ^ + exp (/(x,)) V+/(x,) 

(86) 
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and 

V-iK) = 2 I ^ + exp ifixa)) y+fixa)] + exp (/(x,)) ^ ^ 



+ exp(/(a;,))V-/K) 
(87) 



We need to find the values of V^/(xa) and V /(xq). Note that, the term 
Sfc (Sfc + ^)~^ forms a diagonal matrix. 
Let 



then /(cc) in equation (80) can be written as: 

K 



f{x) = Y,lk{x)H{x). 



k=l 



The gradient of f{x) in equation (89) can be written as: 



K 



k=l 



where 



Tik^l log^|Atfc,Sfc + * 



Mk{x) 



(89) 



(90) 



(91) 



We can also write the gradient components of H{xa) and 7A;(a3) as a difference 
of two positive terms 



VH{Xa) = V+H{Xa) - V-H{Xa), 



(92) 



and 



V7fc(Xa) = V+7fc(Xa) - V -fkiXa). 



(93) 
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The gradient of f{xa) in equations (85, 90) can be written as: 



K 



(94) 



k=l 



where 



K 



V-/(a;a) = J2 bk{x)V-H{Xa) + H-{Xa)V+lk{Xa) + H^{Xa)V-^k{Xa)\ , 

(95) 



k=l 



Xa 
-1 



O'l^) II 1 1 2 ' 



and H{xa) can be written as a difference of two positive terms: 



where 



and 



H-ixr, 



H{Xa)=H+{Xa)-H-{Xa), 



2 J 



We can rewrite 'jk{x) in equation (91) as: 

^ iV,(x)' 
note that 'jkix), Mk{x), Nk{x) > 0. 

The component a of the gradient of 7fe(a;) can be written as: 

Nkix)VMkixa) - Mfc(a;)VArfc(xa) 



V7fc(a;a) 



(96) 
(97) 

(98) 

(99) 
(100) 

(101) 



N',{x) 



(102) 
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We can write the gradients of Mk{x) and Nk{ clS cL difference of two positive 
terms 



and 



K 



K 



k=l 



k=l 



Tfie gradient of 7fc(xa) in equation (93) can be written as: 



V+7fc(x,) 



(103) 



(104) 



(105) 



iVfc(a;)VM,-(xJ + Mfc(a;)Ef=iV+M,(x. 



V 7fc(Xa) 



where 



V+Mk{Xa) = Mk{x) (Sfc,, + 



,-1 



'1 1 fJ,f^ Xa 

log 



and 



V-MkiXa) = Mk{x) (Sfc^^ + *,,) ' 



1 -^a 



(106) 



(107) 



(108) 



After finding V"'"7fc(xa), and V 'jkixa) from equations (105, 106), andV'^ H{xa 



and V H{xa) from equations (96, 97), we can find the gradients V^/(x, 



and V f{xa) in equations (94, 95), which complete our solution for V"'"-L(xa), 



and V L{xa) in equations (86, 87) 
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